DTSA 5510 - Unsupervised Algorithms in Machine Learning - Final Project¶


Contents¶

  1. Introduction and overview of the problem
  2. Description of the data sources
  3. Data cleaning
  4. Exploratory data analysis
  5. Modeling
  6. Results and analysis
  7. Discussion and conclusion

In [1]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

from IPython.display import Image
from IPython.core.display import HTML 

import numpy as np
import pandas as pd
from itertools import permutations

import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import pprint
pp = pprint.PrettyPrinter()

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, KFold, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans, AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from xgboost import XGBClassifier, plot_importance

1. Introduction and overview of the problem¶

The data set we will be looking at describes various aspects of a country's development for 160 countries around the world. The fictitious scenario is that we are an NGO that needs to decide how to allocate it's funding according to the data it has on the countries. Since this is an unlabeled data set, we would not be able to compare the performance of any unsupervised algorithms without bringing in a set of labels.

Here is the twist I'm adding to the scenario: As the leaders of the NGO, we have decided that the features in our data set are the correct ones to use in allocating our funds, but we'd like to see if the United Nations has existing classifications and designations that align with our chosen features. To determine this, we will join in data from two lists. First is the United Nations World Economic Situation and Prospects (WESP) designation. This is the commonly referred to list of developed, developing and in-transition countries. Second is the United Nations Development Programme's Human Development Index (HDI). The HDI rates countries as Very High, High, Medium and Low on a development scale that measures health, knowledge and standard of living dimensions.

The approach for this project is to take the unlabeled country data and use K-Means and Hierarchical Clustering to identify clusters in the data. We will then use the generated clusters as classes, and compare them to the two UN classifications. Finally, we will use XGBoost to build a classic supervised model trained on the two UN classifications.

The goal is not to see if we can replicate the UN classifications from our data, but rather to see if the patterns in our data lead to similar classification as the UN lists. If our models generate similar results from our data set, then we can conclude that the UN uses similar data to derive their classifications. What is more likely is that we will discover that our data does not lead to the same classification as the UN, either because our model is different, our data is different, or the UN applies some subjective logic to classifying countries.

Image of Yaktocat


2. Description of the data sources¶

  • Country data. This data is available in a few places on Kaggle, but I retrieved it from here: https://github.com/pycaret/pycaret/blob/master/datasets/country-data.csv. It contains data on the human living conditions in each country like child mortality rates, life expectancy, percent of GDP spent on healthcare, inflation, and others.

  • The UN WESP data was retrieved from: https://www.un.org/en/development/desa/policy/wesp/wesp_current/2014wesp_country_classification.pdf. It contains, among other things, classifications of countries into Developing, In Transition and Developed.

  • The UN Human Development Index data was retrieved from: https://hdr.undp.org/data-center/human-development-index#/indicies/HDI. It categorizes countries into Very High, High, Medium and Low levels on the HDI scale.


3. Data cleaning¶

I was hoping to be able to simply join the three data sets together based on country name, but upon inspection, this proved to not be possible. The names of the countries was not reliably the same across the three lists. One discrepancy is the use of official vs colloquial country names. For example: South Korea vs Republic of Korea and North Korea vs Democratic People's Republic of Korea or Venezuela vs Bolivarian Republic of Venezula. Other discrepancies arose from the inconsistent use of diacritics and other special characters. For example Turkey vs Türkiye. In the end I manually merged the three lists. Luckily there were only 160 countries in the resulting merged list.

Other than that, the data is clean. There is no missing data to be imputed or duplicated rows to be resolved.

In [2]:
country_data = pd.read_csv('data/Country-data_augmented.csv')
country_data
Out[2]:
Country Child Mortality Exports Health Imports Income Inflation Life Expectancy Fertility GDPP WESP Classification HDI Classification
0 Afghanistan 90.2 10.0 7.58 44.9 1610 9.44 56.2 5.82 553 Developing Low
1 Albania 16.6 28.0 6.55 48.6 9930 4.49 76.3 1.65 4090 In transition High
2 Algeria 27.3 38.4 4.17 31.4 12900 16.10 76.5 2.89 4460 Developing High
3 Angola 119.0 62.3 2.85 42.9 5900 22.40 60.1 6.16 3530 Developing Medium
4 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44 76.8 2.13 12200 Developing High
... ... ... ... ... ... ... ... ... ... ... ... ...
155 Vanuatu 29.2 46.6 5.25 52.7 2950 2.62 63.0 3.50 2970 Developing Medium
156 Venezuela 17.1 28.5 4.91 17.6 16500 45.90 75.4 2.47 13500 Developing Medium
157 Vietnam 23.3 72.0 6.84 80.2 4490 12.10 73.1 1.95 1310 Developing High
158 Yemen 56.3 30.0 5.18 34.4 4480 23.60 67.5 4.67 1310 Developing Low
159 Zambia 83.1 37.0 5.89 30.9 3280 14.00 52.0 5.40 1460 Developing Medium

160 rows × 12 columns

In [3]:
# Split dependent and independent vars, label encode the categorical targets.
country_names = country_data['Country']
X = country_data.drop(['Country', 'WESP Classification', 'HDI Classification'], axis=1)
y = country_data[['WESP Classification', 'HDI Classification']].copy()
le_3 = LabelEncoder()
le_4 = LabelEncoder()
y['wesp_label'] = le_3.fit_transform(y['WESP Classification'])
y['hdi_label'] = le_4.fit_transform(y['HDI Classification'])
print(f'{country_data.shape=}')
print(f'{country_names.shape=}')
print(f'{X.shape=}')
print(f'{y.shape=}')
country_data.shape=(160, 12)
country_names.shape=(160,)
X.shape=(160, 9)
y.shape=(160, 4)
In [4]:
# Look for missing data
country_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160 entries, 0 to 159
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Country              160 non-null    object 
 1   Child Mortality      160 non-null    float64
 2   Exports              160 non-null    float64
 3   Health               160 non-null    float64
 4   Imports              160 non-null    float64
 5   Income               160 non-null    int64  
 6   Inflation            160 non-null    float64
 7   Life Expectancy      160 non-null    float64
 8   Fertility            160 non-null    float64
 9   GDPP                 160 non-null    int64  
 10  WESP Classification  160 non-null    object 
 11  HDI Classification   160 non-null    object 
dtypes: float64(7), int64(2), object(3)
memory usage: 15.1+ KB
In [5]:
# Look for duplicate country names.
country_data['Country'].nunique()
Out[5]:
160

4. Exploratory data analysis¶

Looking at the distribution of country labels in the WESP classification, we see that there lots of developing countries and few in-transition countries. The definition of in-transition was a bit vague and the UN documentation admits that there is quite a bit of overlap between the three categories. The histogram shows a more even distribution among the HDI classes.

The pairplots for WESP reveal that this could be a challenging clustering problem. The developed and in-transition countries are fairly well grouped, but the developing countries cover a wide range of values in almost all the dimensions, making it hard to separate the three groups.

The pairplots for HDI reveal a similar pattern, with definite concentrations appearing, but no distinct clusters. I think in the end, we will find that this data is quite continuous and that classification is more a matter of drawing arbitrary thresholds than identifying completely distinct categories of countries. ie: we'll arbitrarily say that countries with a score of 1-2 are low, 3-5 are medium, 6-8 are high and 9-10 are very high for example.

Looking at the boxplots for individual features, separated by classes reveals similar insights. None of the features are cleanly separable by class, there is a lot of overlap. For the WESP classes, Developed seems fairly distinct, while In Transition and Developing largely overlap. For the HDI classes, there seems to be a nice linear progress from Low to Very High, which suggests those classes were derived using similar data to ours.

My intuition is that we will have reasonable success at aligning our clusters with the UN data, with more success on the HDI classes than the WESP classes.

In [6]:
wesp_cat_order = {
    'WESP Classification': ['Developing', 'In transition', 'Developed']
}
hdi_cat_order = {
    'HDI Classification': ['Low', 'Medium', 'High', 'Very High']
}
In [7]:
fig = px.histogram(country_data, x='WESP Classification', histnorm='probability density', category_orders=wesp_cat_order)
fig.show()
In [8]:
fig = px.histogram(country_data, x='HDI Classification', histnorm='probability density', category_orders=hdi_cat_order)
fig.show()
In [9]:
sns.pairplot(country_data, hue='WESP Classification');
No description has been provided for this image
In [10]:
sns.pairplot(country_data, hue='HDI Classification');
No description has been provided for this image
In [11]:
for col in X.columns:
    fig = px.box(country_data, x="WESP Classification", y=col, category_orders=wesp_cat_order)
    fig.show()
In [12]:
for col in X.columns:
    fig = px.box(country_data, x="HDI Classification", y=col, category_orders=hdi_cat_order)
    fig.show()

5. Modeling¶

We start with a K-Means model, applying the elbow method to determine an appropriate K. From the chart, any value between 3 and 5 would be reasonable. This is fortunate, since the two sets of labels we have contain 3 and 4 labels. We will continue the rest of the analysis using both the 3-category WESP labels and the 4-category HDI labels.

We then fit a hierarchical clustering model with 3 or 4 clusters against the WESP and HDI data respectively.

Finally we fit an XGBoost model to see how the two unsupervised models above compare to a supervised model.

This gives us a total of 6 models (3 types of model times 2 sets of labels). We compute weighted F1 scores and display a confusion matrix for each model and will discuss the results in the following section.

In [13]:
# Scaling
scaler = MinMaxScaler()
features = X.columns
X_scaled = pd.DataFrame(scaler.fit_transform(X))
X_scaled.columns = features
X_scaled.describe()
Out[13]:
Child Mortality Exports Health Imports Income Inflation Life Expectancy Fertility GDPP
count 160.000000 160.000000 160.000000 160.000000 160.000000 160.000000 160.000000 160.000000 160.000000
mean 0.176552 0.206024 0.310938 0.265554 0.135562 0.112954 0.757643 0.287549 0.123879
std 0.199770 0.139603 0.169209 0.140476 0.157512 0.099178 0.178836 0.242382 0.177567
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.025073 0.117269 0.193599 0.171094 0.021714 0.056210 0.638067 0.100552 0.010299
50% 0.081061 0.175551 0.280609 0.246554 0.079113 0.089363 0.811637 0.198738 0.044087
75% 0.290652 0.256595 0.426352 0.331787 0.185029 0.141715 0.884615 0.468454 0.140490
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
In [14]:
# KMeans elbow method
ssd = []
for k in range(1, 11):
    km = KMeans(n_clusters=k, random_state=42)
    km = km.fit(X_scaled)
    ssd.append(km.inertia_)
In [15]:
plt.plot(range(1, 11), ssd, 'bx-')
plt.xlabel('K')
plt.ylabel('Sum of squared distances')
plt.title('K vs SSD')
plt.show()
No description has been provided for this image
In [16]:
# KMeans predictions for 3 and 4 clusters
km_3 = KMeans(n_clusters=3, random_state=42)
y_pred = pd.DataFrame(km_3.fit_predict(X_scaled))
y_pred.columns = ['km_3']
In [17]:
km_4 = KMeans(n_clusters=4, random_state=42)
y_pred['km_4'] = km_4.fit_predict(X_scaled)
In [18]:
# Hierarchical clustering predictions for 3 and 4 clusters
hc_3 = AgglomerativeClustering(n_clusters=3, compute_distances=True).fit(X_scaled)
y_pred['hc_3'] = hc_3.labels_
In [19]:
hc_4 = AgglomerativeClustering(n_clusters=4, compute_distances=True).fit(X_scaled)
y_pred['hc_4'] = hc_4.labels_
In [20]:
# Source: https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html
def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)
In [21]:
plot_dendrogram(hc_3, p=3)
No description has been provided for this image
In [22]:
plot_dendrogram(hc_4, p=4)
No description has been provided for this image
In [23]:
def label_permute_compare(y_true, y_pred, categories):
    best_acc = 0
    best_labels = None
    n = len(categories)

    for label_order in permutations(range(n), n):
        cats = {categories[i]: label_order[i] for i in range(n)}
        label_candidate = [cats[x] for x in y_true]
        acc = f1_score(label_candidate, y_pred, average='weighted')
        if acc > best_acc:
            best_acc = acc
            best_labels = label_order
    return best_labels, best_acc
In [48]:
# Performance evaluation
categories_3 = ['Developed', 'Developing', 'In transition']
categories_4 = ['High', 'Low', 'Medium', 'Very High']
results = {}
In [50]:
# K-Means performance for 3-category WESP labels.
results['km_3'] = label_permute_compare(y['WESP Classification'], y_pred['km_3'], categories_3)
y_pred['km_3_classes'] = [categories_3[results['km_3'][0].index(i)] for i in y_pred['km_3']]
results['km_3']
Out[50]:
((2, 0, 1), 0.6113377980300101)
In [51]:
print(classification_report(y['WESP Classification'], y_pred['km_3_classes']))
               precision    recall  f1-score   support

    Developed       0.79      0.73      0.76        37
   Developing       1.00      0.43      0.60       107
In transition       0.20      1.00      0.33        16

     accuracy                           0.56       160
    macro avg       0.66      0.72      0.57       160
 weighted avg       0.87      0.56      0.61       160

In [52]:
cm = confusion_matrix(y['WESP Classification'], [categories_3[results['km_3'][0].index(i)] for i in y_pred['km_3']])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=categories_3)
disp.plot()
plt.show()
No description has been provided for this image
In [54]:
# Hierarchical clustering performance for 3-category WESP labels.
results['hc_3'] = label_permute_compare(y['WESP Classification'], y_pred['hc_3'], categories_3)
y_pred['hc_3_classes'] = [categories_3[results['hc_3'][0].index(i)] for i in y_pred['hc_3']]
results['hc_3']
Out[54]:
((0, 2, 1), 0.5584425625920472)
In [55]:
print(classification_report(y['WESP Classification'], y_pred['hc_3_classes']))
               precision    recall  f1-score   support

    Developed       0.73      0.65      0.69        37
   Developing       0.67      0.54      0.60       107
In transition       0.00      0.00      0.00        16

     accuracy                           0.51       160
    macro avg       0.46      0.40      0.43       160
 weighted avg       0.61      0.51      0.56       160

In [56]:
cm = confusion_matrix(y['WESP Classification'], [categories_3[results['hc_3'][0].index(i)] for i in y_pred['hc_3']])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Developed', 'Developing', 'In transition'])
disp.plot()
plt.show()
No description has been provided for this image
In [57]:
# K-Means cluster performance for 4-category HDI labels.
results['km_4'] = label_permute_compare(y['HDI Classification'], y_pred['km_4'], categories_4)

y_pred['km_4_classes'] = [categories_4[results['km_4'][0].index(i)] for i in y_pred['km_4']]

results['km_4']
Out[57]:
((1, 0, 3, 2), 0.5241771515945828)
In [58]:
print(classification_report(y['HDI Classification'], y_pred['km_4_classes']))
              precision    recall  f1-score   support

        High       0.54      0.93      0.68        40
         Low       0.63      1.00      0.77        29
      Medium       0.00      0.00      0.00        31
   Very High       1.00      0.40      0.57        60

    accuracy                           0.56       160
   macro avg       0.54      0.58      0.51       160
weighted avg       0.62      0.56      0.52       160

In [59]:
cm = confusion_matrix(y['HDI Classification'], [categories_4[results['km_4'][0].index(i)] for i in y_pred['km_4']])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=categories_4)
disp.plot()
plt.show()
No description has been provided for this image
In [60]:
# Hierarchical clustering performance for 4-category HDI labels.
results['hc_4'] = label_permute_compare(y['HDI Classification'], y_pred['hc_4'], categories_4)

y_pred['hc_4_classes'] = [categories_4[results['hc_4'][0].index(i)] for i in y_pred['hc_4']]

results['hc_4']
Out[60]:
((0, 1, 3, 2), 0.5370577998402374)
In [61]:
print(classification_report(y['HDI Classification'], y_pred['hc_4_classes']))
              precision    recall  f1-score   support

        High       0.45      0.97      0.61        40
         Low       0.68      0.93      0.78        29
      Medium       0.00      0.00      0.00        31
   Very High       0.97      0.48      0.64        60

    accuracy                           0.59       160
   macro avg       0.52      0.60      0.51       160
weighted avg       0.60      0.59      0.54       160

In [62]:
cm = confusion_matrix(y['HDI Classification'], [categories_4[results['hc_4'][0].index(i)] for i in y_pred['hc_4']])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=categories_4)
disp.plot()
plt.show()
No description has been provided for this image
In [63]:
# Supervised training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=24)
In [64]:
xgb_3_scores = cross_val_score(XGBClassifier(random_state=42), X, y['wesp_label'], scoring='f1_weighted', cv=5)
print(f'Average 3-class xgb F1 score: {np.mean(xgb_3_scores):.3f}')
Average 3-class xgb F1 score: 0.881
In [65]:
xgb_3 = XGBClassifier(random_state=42)
xgb_3.fit(X_train, y_train['wesp_label'])
y_pred_sup = pd.DataFrame(le_3.inverse_transform(xgb_3.predict(X_test)))
y_pred_sup.columns = ['xgb_3']
In [66]:
print(classification_report(y_test['WESP Classification'], y_pred_sup['xgb_3']))
               precision    recall  f1-score   support

    Developed       0.89      0.89      0.89         9
   Developing       0.86      0.95      0.90        20
In transition       0.00      0.00      0.00         3

     accuracy                           0.84        32
    macro avg       0.58      0.61      0.60        32
 weighted avg       0.79      0.84      0.82        32

In [67]:
cm = confusion_matrix(y_test['WESP Classification'], y_pred_sup['xgb_3'])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=categories_3)
disp.plot()
plt.show()
No description has been provided for this image
In [68]:
xgb_4_scores = cross_val_score(XGBClassifier(random_state=42), X, y['hdi_label'], scoring='f1_weighted', cv=5)
print(f'Average 4-class xgb F1 score: {np.mean(xgb_4_scores):.3f}')
Average 4-class xgb F1 score: 0.754
In [69]:
xgb_4_scores
Out[69]:
array([0.76788836, 0.80849359, 0.71458333, 0.72291667, 0.7582651 ])
In [70]:
xgb_4 = XGBClassifier(random_state=42)
xgb_4.fit(X_train, y_train['hdi_label'])
y_pred_sup['xgb_4'] = le_4.inverse_transform(xgb_4.predict(X_test))
In [71]:
print(classification_report(y_test['HDI Classification'], y_pred_sup['xgb_4']))
              precision    recall  f1-score   support

        High       1.00      0.70      0.82        10
         Low       0.80      1.00      0.89         4
      Medium       0.67      0.67      0.67         6
   Very High       0.86      1.00      0.92        12

    accuracy                           0.84        32
   macro avg       0.83      0.84      0.83        32
weighted avg       0.86      0.84      0.84        32

In [72]:
cm = confusion_matrix(y_test['HDI Classification'], y_pred_sup['xgb_4'])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=categories_4)
disp.plot()
plt.show()
No description has been provided for this image

6. Results and analysis¶

The results of the modeling are summarized in the table below.

3-category F1 score 4-category F1 score
K-Means 0.61 0.52
Hierarchical Clustering 0.56 0.54
XGBoost 0.82 0.84

Looking deeper into the classification report reveals more interesting insights than merely looking at the F1 scores. For example, on the WESP labels, all three models struggled with the in-transition label, particularly with distinguising it from the developing labels. This alings with my intuition from the visualizations that there is a lot of overlap between those two groups.

Looking at the HDI labels, both of the unsupervised labels had the most success with the Low labels and struggled the most with the Medium labels. This also aligns with the EDA, which showed that Low HDI countries appeared more separable. Although the numbers in each box are small because we have a small test set spread across a 4x4 grid, the XGBoost model seems to perform pretty well on the HDI data, with respectable F1 scores for every category.


7. Discussion and conclusion¶

The question we originally posed was whether the country data we obtained could be clustered or modeled with XGBoost to achieve similar results as the two UN classification systems for countries. The results show that there is a better-than-random agreement between the unsupervised models and the UN classes, but it is not very strong. This suggests that (unsurprisingly) the UN uses a different methodology and/or features to derive it's country classifications, but that there is some overlap with the data we have used in this analysis. One would hope that the UN would also use their expert opinions to make judgment calls when designating countries as in-transition, since there does not seem to be a purely data-driven reason for some of these categorizations.

In [ ]: